refactor: consolidate msm and batch-msm implementations to one function #342

yshekel · 2024-01-16T15:22:28Z

refactor: consolidate msm and batch-msm implementations to one function

now batch-msm support parallel BM accumulation in addition to large triangle accumulation

- now batch-msm support parallel BM accumulation in addition to large triangle accumulation

HadarIngonyama

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

HadarIngonyama · 2024-01-17T16:36:17Z

icicle/appUtils/msm/msm.cu

      S* scalars,
      A* points,
-      int size,
+      unsigned batch_size,


The definition of these 3 variables is unclear, please add a description. Also, it seems that only 2 out of 3 are needed no?

I will add comments and rename variables to make the code clearer.

I don't think 2 variables are enough:

msm_size is the number of elements per msm sum

batch_size is how many msms to compute in the batch

nof_points can either be msm_size (if sharing points) or msm_size*batch_size (if not sharing points).

maybe I can replace nof_points by a boolean that tells if sharing points between batch elements or not. It would still be 3 variables and require modification of 'MSMConfig' which would break the API so I prefer avoiding it.

note that nof_scalars is always assumed to be msm_size * batch_size. Theoretically MSMs could share scalars but not points but I think it's not an interesting case and again I want to be fully compatible with current API.

I would leave three variables. By design it should be possible to have different points in the first and second MSMs and then re-use points from the first for the third and then points from the second for the fourth and so on...
In this case not_points = 2 * msm_size

HadarIngonyama · 2024-01-17T16:37:05Z

icicle/appUtils/msm/msm.cu

@@ -530,7 +542,7 @@ namespace msm {
      CHK_IF_RETURN(cudaFreeAsync(sort_single_temp_storage, stream));

      // find large buckets
-      unsigned avarage_size = size / (1 << c);
+      unsigned avarage_size = msm_size / (1 << c);


making sure that this is the size of a single msm

yes it is. I will rename variables to make the code clearer.

HadarIngonyama · 2024-01-17T16:41:09Z

icicle/appUtils/msm/msm.cu

-      if (is_big_triangle) {
-        CHK_IF_RETURN(cudaMallocAsync(&final_results, sizeof(P) * nof_bms, stream));
+      if (is_big_triangle || c == 1) {
+        CHK_IF_RETURN(cudaMallocAsync(&final_results, sizeof(P) * nof_bms * batch_size, stream));


this seems to be used a lot, maybe define total_nof_bms = nof_bms*batch_size

HadarIngonyama · 2024-01-17T16:46:36Z

icicle/appUtils/msm/msm.cu

-            nof_bms = bitsize;
-            CHK_IF_RETURN(cudaMallocAsync(&final_results, sizeof(P) * nof_bms, stream));
+            // Note: the reduction ends up with 'target_windows_count' windows per batch element. for batch=1 we can
+            // only sum the first 'bitsize' elements to avoid a few ECADD which add zeros


why isn't this true for every batch size?

It is true for every batch size.
The thing is that for fields like BLS12-377 where the scalar field is 253b we end up with >253 bms (e.g. 254 bms for c=2 with 1 iteration of reduction from 127 to 254 bms. For other cases can be a little more than that) where the most significant ones are guaranteed to be zero.
The original code summed the bms based on bitsize so effectively summing only the non-zero bms. It is slightly more efficient more summing all bms (254 in the example I gave). I wrote the code this way for batch=1 since I did not want to end up with worse performance (although maybe negligible).

generally for batch>1 we must account for all bms so that the other batch elements are summing the right bms. If setting bm=253, the second batch element is summing wrong bms and all results end up wrong except for the first batch element.

Do you think I should always set bm based on the final number of windows? it is slightly less efficient but more drastic.

why not have each msm sum only the non zero buckets? (and make sure that the buckets are the correct ones - I'm pretty sure you can use bm=253 and correct the kernel to sum the correct buckets) let's go for best performance

Yes I think you are right. I will do that.

HadarIngonyama · 2024-01-17T16:48:13Z

icicle/appUtils/msm/msm.cu

-      final_accumulation_kernel<P, S>
-        <<<1, 1, 0, stream>>>(final_results, is_result_on_device ? final_result : d_final_result, 1, nof_bms, c);
+      // launch the double and add kernel, a single thread per batch element
+      NUM_THREADS = 1;


better use 32 threads min

HadarIngonyama · 2024-01-17T16:51:12Z

icicle/appUtils/msm/msm.cu

-        config.are_scalars_montgomery_form, config.are_points_on_device, config.are_points_montgomery_form,
-        config.are_results_on_device, config.is_async, stream));
+
+    unsigned c = config.batch_size > 1 ? ((config.c == 0) ? get_optimal_c(msm_size) : config.c) : 16;


why is 16 the default for a single msm? it should also be the closest power of 2 to the optimal c

It was like that for batch=1 and so I did not want to end up with different behavior.

Should I use the closest power of 2 instead? closest from below? I mean does 12 goes to 8 to 16?

I don't remember.. use the value that gives better performance:)

I think the optimum is log(N) (meaning the number of ECADD is minimized) but implementation-wise power of 2 work better from what I read. In addition memory grows exponential in c since you have 2^c buckets per BM so practically c=16 is the largest power of two that is reasonable.

Having said that, the code says:
unsigned get_optimal_c(int bitsize) { return max((unsigned)ceil(log2(bitsize)) - 4, 1U); }

I don't understand why '-4'? It is quite odd since I then round it back to a power of two.

Anyway I will try to play with it a little and figure out what value are working better, but I think this is an optimization we can dedicate another PR and task for since I think it's a heuristic and not really part of this PR.

yes, for now just change 16 to get_optimal_c. (which will then be rounded to the closest power of 2)
the logn-4 is the value that minimizes the amount of ec additions for most sizes, but it's true that with the actual implementation there are other considerations too. Ideally we will have a dictionary.

We discussed in the daily that fine-tuning the value of c is probably out of the scope of this PR..
Imo it'll more appropriate to tabulate best values of c and optimal accumulation method (triangle vs. non-triangle) when we establish proper benchmarking and run out of options to significantly speed up the MSM (potentially altering choices that we hard-coded). I think for now it's ok to leave reasonable defaults and give users an option to edit them via config

I will leave it as is for now then, meaning the behavior remains as it is today.

yshekel · 2024-01-17T17:47:42Z

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

Thanks for the comments, I will fix and push an additional commit.

The PR is testing all the curves, 15 tests each, including 3 MSM tests: msm, batch-msm, skewed-msm. They all Pass. It's the "Test / Test Rust on Linux" check.

Regarding performance: @DmytroTym (please correct me if wrong) profiled batched-msm with this PR: (a) batch=16 N=2^16 and also (b) batch=3 N=1000. For case (a) he measured ~60.7[ms] compared to ~413.8[ms] for the old implementation that used the triangle accumulation. For case (b) he measured ~12.7[ms] compared to the old ~8.9[ms]. In that small case the triangle accumulation is slightly better but of course it is configureable, and maybe a better choice of c can also improve the performance.

I think it will be very useful to have a benchmark to measure such modifications but AFAIK there isn't one. Is that right?

HadarIngonyama · 2024-01-17T18:54:32Z

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

Thanks for the comments, I will fix and push an additional commit.

The PR is testing all the curves, 15 tests each, including 3 MSM tests: msm, batch-msm, skewed-msm. They all Pass. It's the "Test / Test Rust on Linux" check.

Regarding performance: @DmytroTym (please correct me if wrong) profiled batched-msm with this PR: (a) batch=16 N=2^16 and also (b) batch=3 N=1000. For case (a) he measured ~60.7[ms] compared to ~413.8[ms] for the old implementation that used the triangle accumulation. For case (b) he measured ~12.7[ms] compared to the old ~8.9[ms]. In that small case the triangle accumulation is slightly better but of course it is configureable, and maybe a better choice of c can also improve the performance.

I think it will be very useful to have a benchmark to measure such modifications but AFAIK there isn't one. Is that right?

I believe there is a "benches" folder isn't that so?
also - best to have the msm use the "big triangle" automatically only for the parameters where it improves performance - have it as a default in the code.
another check that needs to be done is to check that the performance of a single msm (batch size=1) has not degraded as a result of the new changes.

yshekel · 2024-01-17T19:16:42Z

Looks good to me apart from the comments. In addition - which tests and benches did you run? we need to make sure that batched and non batched are passing and that performance was not affected.

Thanks for the comments, I will fix and push an additional commit.
The PR is testing all the curves, 15 tests each, including 3 MSM tests: msm, batch-msm, skewed-msm. They all Pass. It's the "Test / Test Rust on Linux" check.
Regarding performance: @DmytroTym (please correct me if wrong) profiled batched-msm with this PR: (a) batch=16 N=2^16 and also (b) batch=3 N=1000. For case (a) he measured ~60.7[ms] compared to ~413.8[ms] for the old implementation that used the triangle accumulation. For case (b) he measured ~12.7[ms] compared to the old ~8.9[ms]. In that small case the triangle accumulation is slightly better but of course it is configureable, and maybe a better choice of c can also improve the performance.
I think it will be very useful to have a benchmark to measure such modifications but AFAIK there isn't one. Is that right?

I believe there is a "benches" folder isn't that so? also - best to have the msm use the "big triangle" automatically only for the parameters where it improves performance - have it as a default in the code. another check that needs to be done is to check that the performance of a single msm (batch size=1) has not degraded as a result of the new changes.

I did not see a 'benches' directory. Maybe it was removed.
I will try to see where big-triangle is more efficient in terms of N and batch-size or maybe N*batch_size and default to that.
Regarding batch=1 performance, I will verify it did not degrade but basically this is the same implementation exactly so other than multiplying by 1 in a few places (on host), I don't think it should be different at all.

DmytroTym

Looks good with just one change request related to rounding c to a power of 2
@HadarIngonyama in v1.0.0 we removed benches for now (but we should return them asap). But I indeed profiled some batched MSMs and saw improvements (mostly), the perf of singular MSM is unchanged

DmytroTym · 2024-01-17T22:29:21Z

icicle/appUtils/msm/msm.cu

+    // reduce c to closest power of two if not using big_triangle reduction logic
+    // TODO: support arbitrary values of c
+    if (!config.is_big_triangle) {
+      while ((c & (c - 1)) != 0)


Can't this procedure just hang for some values of c? Like, if c=b101, then c & (c - 1) doesn't add any extra bits so the process is never terminated. Also since the idea is rounding up I'm afraid any MSM of size bigger than 2^20 will end up with c=32 which takes way too much memory

it will never hang and it rounds down, not up:

for powers of two exactly 1 bit is 1'b1 --> meaning 'c & (c-1)' is zero. for example for c=4'b0100, c-1 = 4'b0011.

otherwise each iteration is masking one bit to 1'b0, making the number smaller and therefore it must end in at most 31 iterations.

note that masking c with any number cannot result in a number larger than c, meaning it is rounding down.

in your example c=b101 therefore c-1=b100 --> c& (c-1) = b100. Note that the first non zero bit is masked to 1'b0 and next iteration would mask the next non zero bit.

oh sorry you're right

DmytroTym · 2024-01-17T22:33:46Z

icicle/appUtils/msm/msm.cu

@@ -338,59 +340,63 @@ namespace msm {
    // this function computes msm using the bucket method
    template <typename S, typename P, typename A>
    cudaError_t bucket_method_msm(
-      int bitsize,
-      int c,
+      unsigned bitsize,


Just a style related note: google guide advices to basically use int everywhere. I know that we're still not following this rule fully, but we at least tried to do it in higher-level and public functions. Do you think it's a good idea to stick to it?

I know someone who is very experienced with CUDA and he uses uint32_t wherever he can:)

Well limbs obviously shouldn't be int. I'll just quote google guide to be more precise:

We use int very often, for integers we know are not going to be too big, e.g., loop counters. Use plain old int for such things.
...
You should not use the unsigned integer types such as uint32_t, unless there is a valid reason such as representing a bit pattern rather than a number, or you need defined overflow modulo 2^N. In particular, do not use unsigned types to say a number will never be negative. Instead, use assertions for this.

We discussed it in the daily meeting.
I personally prefer unsigned values (and actually uint32_t as Hadar mentioned since it's more portable and can avoid nasty bugs since sizeof(int) is not guaranteed to be 4 in all systems) when the value should not be negative because it tells the reader that the value should not be negative.

Two notes:
(1) google style suggests asserting the values are not negative instead of using unsigned int. I am not sure it is applicable to kernel code.
(2) when using std containers like std::vector, their size() method returns an unsigned value and therefore the compiler will warn about comparison if the loop counter is signed. Iterators are probably the better design pattern in those cases but this is not really the point here.

Anyway, unless you object, I will leave it as for now. Note that it's not exposed APIs but internal implementation.

yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch 2 times, most recently from 2234b3f to b6fb5c0 Compare January 16, 2024 16:03

yshekel marked this pull request as ready for review January 16, 2024 16:03

yshekel requested review from DmytroTym and HadarIngonyama January 16, 2024 16:04

yshekel marked this pull request as draft January 16, 2024 16:14

yshekel removed request for DmytroTym and HadarIngonyama January 16, 2024 16:16

yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch from b6fb5c0 to 4581424 Compare January 17, 2024 11:31

refactor: consolidate msm and batch-msm implementations to one function

946eb10

- now batch-msm support parallel BM accumulation in addition to large triangle accumulation

yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch from 4581424 to 946eb10 Compare January 17, 2024 14:07

yshekel marked this pull request as ready for review January 17, 2024 14:10

yshekel requested a review from DmytroTym January 17, 2024 14:57

yshekel assigned HadarIngonyama and unassigned HadarIngonyama Jan 17, 2024

yshekel requested a review from HadarIngonyama January 17, 2024 14:57

HadarIngonyama requested changes Jan 17, 2024

View reviewed changes

DmytroTym requested changes Jan 17, 2024

View reviewed changes

yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch 2 times, most recently from 0260d6e to ae37962 Compare January 18, 2024 11:15

yshekel requested review from DmytroTym and HadarIngonyama January 18, 2024 11:16

DmytroTym approved these changes Jan 18, 2024

View reviewed changes

PR fixes: consolidate batch msm and msm: to be squashed

01ea7b6

yshekel force-pushed the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch from ae37962 to 01ea7b6 Compare January 18, 2024 11:37

yshekel merged commit 56fcd2e into dev Jan 18, 2024
11 of 12 checks passed

yshekel deleted the refactor/yshekel/consolidate_msm_with_batch_msm_impl branch January 18, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: consolidate msm and batch-msm implementations to one function #342

refactor: consolidate msm and batch-msm implementations to one function #342

yshekel commented Jan 16, 2024

HadarIngonyama left a comment

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

yshekel Jan 18, 2024 •

edited

Loading

DmytroTym Jan 18, 2024 •

edited

Loading

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

yshekel Jan 17, 2024

HadarIngonyama Jan 17, 2024

DmytroTym Jan 18, 2024 •

edited

Loading

yshekel Jan 18, 2024

yshekel commented Jan 17, 2024

HadarIngonyama commented Jan 17, 2024

yshekel commented Jan 17, 2024

DmytroTym left a comment

DmytroTym Jan 17, 2024

yshekel Jan 18, 2024

DmytroTym Jan 18, 2024

DmytroTym Jan 17, 2024

HadarIngonyama Jan 17, 2024

DmytroTym Jan 18, 2024

yshekel Jan 18, 2024 •

edited

Loading

refactor: consolidate msm and batch-msm implementations to one function #342

refactor: consolidate msm and batch-msm implementations to one function #342

Conversation

yshekel commented Jan 16, 2024

HadarIngonyama left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yshekel Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

DmytroTym Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmytroTym Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yshekel commented Jan 17, 2024

HadarIngonyama commented Jan 17, 2024

yshekel commented Jan 17, 2024

DmytroTym left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yshekel Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

yshekel Jan 18, 2024 •

edited

Loading

DmytroTym Jan 18, 2024 •

edited

Loading

DmytroTym Jan 18, 2024 •

edited

Loading

yshekel Jan 18, 2024 •

edited

Loading